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Abstract 

In this paper, we examine the fundamental trade-off between radiated power and achieved throughput in wireless 
multi-carrier, multiple-input and multiple-output (MIMO) systems that vary with time in an unpredictable fashion (e.g. 
due to changes in the wireless medium or the users’ QoS requirements). Contrary to the static/stationary channel regime, 
there is no optimal power allocation profile to target (either static or in the mean), so the system’s users must adapt to 
changes in the environment “on the fly”, without being able to predict the system’s evolution ahead of time. In this 
dynamic context, we formulate the users’ power/throughput trade-off as an online optimization problem, and we provide 
a matrix exponential learning algorithm that leads to no regret - i.e. the proposed transmit policy is asymptotically 
optimal in hindsight, irrespective of how the system evolves over time. Furthermore, we also examine the robustness 
of the proposed algorithm under imperfect channel state information (CSI) and we show that it retains its regret 
minimization properties under very mild conditions on the measurement noise statistics. As a result, users are able to 
track the evolution of their individually optimum transmit profiles remarkably well, even under rapidly changing network 
conditions and high uncertainty. Our theoretical analysis is validated by extensive numerical simulations corresponding 
to a realistic network deployment, and providing further insights in the practical implementation aspects of the proposed 
algorithm. 


Index Terms 

Power allocation; MIMO; OFDMA; online optimization; no regret; matrix exponential learning. 

I. Introduction 

The wildfire spread of Internet-enabled mobile devices is putting existing wireless systems under enormous strain 
and is one of the driving forces behind the transition to next-generation mobile networks [1]. In this context, the 
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efficient control and allocation of radiated power comprises an indispensable aspect of wireless system design: for 
many applications (such as e-mail and voice calls), radiated power must be reduced to the bare minimum in order 
to preserve battery life; by contrast, for rate-hungry applications (such as multimedia streaming and video calling), 
it is crucial to optimize the allocation of the users’ limited power across the network’s degrees of freedom so as to 
maximize their throughput. In this way, wireless users are facing an important trade-ofF between radiated power and 
achieved throughput which must often be resolved in an adaptive and distributed manner, with minimal coordination 
between users. 

In its most basic form, power control (PC) allows wireless links to achieve a target throughput while minimizing 
radiated power and the induced co-channel interference (CCI). Accordingly, power control has had a pivotal impact 
on wireless system design and operation ever since the early development stages of legacy wireless networks: starting 
with the pioneering work of Zander [2], Grandhi et al. [3], Foschini and Miljanic [4] and Yates [5], the design of 
efficient power control algorithms has given rise to a vast and extremely active corpus of literature - see e.g. [6] for a 
survey. Thus, in view of recent advances in MIMO technologies and the prolific deployment of orthogonal frequency- 
division multiple access (OFDMA) schemes, the envisioned transition to 5th generation (5G) mobile systems calls for 
power control algorithms tailored to networks with several degrees of freedom (spectral as well as spatial). 

In this setting, most of the relevant literature has focused on maximizing the users’ achievable transmission rate 
subject to their individual power constraints: [7-9] treat rate maximization as a constrained nonlinear optimization 
problem whereas [10-12] focus on multiple user interactions using game-theoretic methods; in a similar vein, [13- 
15] studied the power minimization problem subject to the users’ rate requirements in multi-carrier multiple access 
channels (MACs), while [16] provided a two-layer framework for power minimization in MIMO-OFDMA systems. 
However, while the benefits of power control algorithms are relatively easy to assess in static networks, it is much 
harder to analyze their behavior in wireless systems that vary with time (e.g. due to user mobility, fading, temporal 
variations in the wireless medium, etc.). In the ergodic regime (where the users’ channels follow a stationary ergodic 
process), [17, 18] provided power control algorithms that minimize the users’ transmit power while achieving a mini¬ 
mum ergodic rate requirement. More recently, the authors of [19] studied the problem of ergodic rate maximization in 
fast-fading multi-carrier systems and they provided an efficient power allocation algorithm that allows users to attain 
the system’s (ergodic) sum-capacity. However, when the wireless medium does not evolve according to an independent 
and identically distributed (i.i.d.) sequence of random variables, the efficient allocation and control of radiated power 
remains a very open issue. 

In this paper, we drop all stationarity/i.i.d. assumptions and we focus squarely on wireless systems that evolve 
arbitrarily over time in terms of both channel conditions and user quality of service (QoS) requirements. In this 
framework, standard approaches based on linear programming (for static channels) and/or stochastic optimization 
(for the ergodic regime) are no longer relevant because there is no underlying optimization problem to solve - either 
static or in the mean. Instead, we treat power control as a dynamically evolving optimization problem and we employ 
techniques and ideas from online learning and optimization [20] to quantify how well the system’s users can adapt to 
changes in the wireless medium. 
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The most widely used performance criterion in this setting is that of regret minimization , a seminal concept which 
was first introduced by Hannan [21] and which has since given rise to a vigorous literature at the interface of machine 
learning, optimization, statistics, and game theory - for a comprehensive survey, see e.g. [20, 22], Specifically, in 
the language of game theory, the notion of regret compares a user’s cumulative payoff over a given time horizon to 
the cumulative payoff that he would have obtained by employing the a posteriori best possible action over the time 
horizon in question. Accordingly, in the context of power allocation and control, regret minimization corresponds to 
dynamic transmit policies that are asymptotically optimal in hindsight, irrespective of how the user’s environment 
and/or requirements evolve over time. 

Regret minimization was recently used in [23] to study the transient phase of the Foschini-Miljanic (FM) power 
control algorithm in static environments and to propose alternative convergent power control schemes based on the 
notion of swap regret [24]. In [25], the authors considered a potential game formulation for the joint power control 
and channel allocation problem in cognitive radio (CR) networks and they employed a regret minimizing algorithm 
[26] to reach a Nash equilibrium state. The same problem was also examined in the context of infrastructureless 
wireless networks by the authors of [27] who formulated the problem as a potential game and provided a power 
control algorithm based on internal regret minimization that converges to the game’s unique correlated - and, hence, 
Nash - equilibrium. Finally, in a very recent paper, the authors of [28] employed online optimization methodologies 
to derive a dynamic transmit policy for online rate maximization in cognitive radio networks, but without attempting 
to control the users’ radiated power level. 

Summary of results and paper outline 

In this paper, we focus on multi-user MIMO-OFDMA systems that evolve arbitrarily over time (for instance, due to 
fading, intermittent user activity, changing QoS requirements, etc.), and we seek to provide an efficient power control 
and allocation scheme that allows users to balance their radiated power against their achieved throughput “on the fly”, 
based only on locally available (and possibly imperfect) CSI. In particular, we formulate the wireless users’ power 
minimization/throughput maximization trade-off as an online optimization problem and we derive a no-regret power 
control policy based on the method of matrix exponential learning (MXL) [29-31], The proposed MXL algorithm is 
provably asymptotically optimal against the system’s evolution in hindsight; furthermore, it also enjoys the following 
desirable properties: 

. Distributedness: users update their own power profiles based only on local information. 

. Asynchronicity: there is no need for a global update timer to synchronize user updates. 

• Robustness: the algorithm retains its properties even under imperfect CSI. 

. Statelessness: transmitters do not need to know the network’s state and/or topology. 

This work builds on (and significantly extends) our recent results on the regret minimization properties of the 
original Foschini-Miljanic dynamics in single-input and single-output (SISO), single-carrier systems that evolve 
continuously over time [32], Compared to [32], the current paper represents an extension to multi-carrier systems with 
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several antennas (at both the transmitter and the receiver) and with imperfect feedback and channel state information 
at the transmitter (CSIT). 

After presenting our wireless system model in Section II, the proposed algorithm for adaptive power control in 
MIMO-OFDMA systems is derived in Section III. Our main result therein is that the proposed algorithm leads to 
no regret; in addition, we examine the algorithm’s behavior in the presence of imperfect CSI and we show that the 
algorithm retains its regret minimization properties almost surely, irrespective of the measurement noise level. Our 
theoretical analysis is supplemented by extensive numerical simulations in Section IV where we illustrate the power 
and throughput gains of the proposed power control algorithm under realistic network conditions. 


II. System Model and Problem Formulation 


Consider a set II = {1,..., U) of wireless point-to-point connections formed over a set of orthogonal subcarriers 
X = {1,..., K)\ assume further that each connection it e 11 comprises a transmit-receive pair (t u , r u ) with M„ antennas 
at the transmitter and N u antennas at the receiver. Thus, if x" e C M " and y k e C' v " denote respectively the signals 
transmitted and received over connection u on subcarrier k, we obtain the familiar signal model: 


y* 


_ 1 TUU U . 
— n (- x k 


I, 


Hr 


(i) 


where z ll k e C :V " denotes the ambient noise over subcarrier k (including thermal, atmospheric and other peripheral 
interference effects) and H™ e C :V: ' X ' M| is the transfer matrix between t v and r u . 

Unavoidably, the received signal y" is affected by the ambient noise and interference due to the transmissions of 
other connections on the same subcarrier, so we will write 


w. 


2, 


h 7K + 4 


( 2 ) 


for the multi-user interference-plus-noise (MUI) at the receiver r u of connection u (for a schematic representation, see 
Fig. 1); in this way, (1) attains the simpler form 


y“ = H"X + w“ 


(3) 


In particular, in what follows, we will focus on a specific connection u e U, so, for clarity, we will drop the index u 
altogether and we will write (3) even more compactly as: 


Yk = + w k . 


(4) 


In this context, assuming Gaussian input and noise and single user decoding (SUD) at the receiver (i.e. the multi¬ 
user interference by all other users is treated as additive noise), the transmission rate of the focal connection will be 
[33, 34]: 

R(Q) = Yjkeoc f log det ( W * + H ^ H 1) - l0 8 det W *] - (5) 

where H + denotes the Hermitian conjugate of H and: 

• Q* = E |x/(X^ | is the M x M covariance matrix of the transmitted signal over subcarrier k. 
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Fig. 1. Example of a network with several active connections where we focus on a particular connection u between transmitter t u and receiver r u . 
The other active connections v e IX \ {«} cause co-channel interference to the focal connection u 6 U which, together with the ambient subcarrier 
noise, is treated as additive colored noise. 


. Q = diag (Qi,..., Qk) denotes the power profile of the focal transmitter over all subcarriers. 

. W k = E [wjtw].] is the N x N MUI covariance matrix over subcarrier k. 

In view of the above, let 

H* = W a : 1/2 H, (6) 

denote the user’s effective channel matrix over subcarrier k. Then, Eq. (5) can be written as: 

R(Q) = l0g det (* + (7) 

or, even more concisely: 

R( Q) = log det (i + HQH 1 ) , (8) 

where the block-diagonal matrix H = diag (H|,..., H*-) collects the user’s effective channel matrices over all subcar¬ 
riers k e 3C. 

As we mentioned in the introduction, we focus on wireless users who seek to minimize their radiated power on the 
one hand while maximizing their transmission rate on the other. Thus, to account for this trade-off between transmit 
power and achieved throughput, we will consider the general power control objective: 

^(Q) = tr [Q] - <p (R(Q)) (9) 

where <p: R + —> iR is a nondecreasing function of the user’s achievable transmission rate R( Q). By this token, f(Q) 
can be interpreted as a “loss function” (or negative utility): higher values of f(Q) indicate that the user is transmitting 
at very high power, at very low rate, or both, so he is incurring a “loss”. Accordingly, we will only assume that 
f is Lipschitz continuous and concave: the former assumption is a mild technical requirement which we make for 
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simplicity, while the latter reflects the effects of “diminishing returns” on ever higher data rates (a rate increase from 
1 bps to 2 bps is more impactful than an increase from 1,001 bps to 1,002 bps). 

Remark 1. Utility-based formulations have a long history in the power control literature - see e.g. the recent papers 
[27, 35] for a related approach and [36, 37] for a similar formulation in terms of energy efficiency. Other possible 
approaches could involve achieving the Pareto frontier of the dual-objective trade-off between power minimization 
and throughput maximization; we focus on the specific model (9) on account of the model’s flexibility, generality and 
overall simplicity. 

Remark 2. An important special case of the objective (9) concerns the scenario where the focal user seeks to minimize 
his transmit power tr [Q] subject to achieving a target transmission rate R*. This classical formulation of power control 
can be recovered by considering a rate-adjustment function (f> of the form (MR) = f(R* - R) with f(r) = 0 if r < 0 and 
f(r) < 0 otherwise - for instance, a standard choice would be to take (MR) = —A ■ [/?* —/?] + for some A > 0. In this way, 
when the target transmission rate is achieved (i.e. R( Q) > /?*), the only term in the user’s loss function (9) is the user’s 
total transmit power tr [Q]; otherwise, if the target transmission rate is not met, the user incurs an additional loss of at 
least 7/(0 ) • (R* - R( Q)). 1 In this way, the (positive) factor A = 7/(0 ) represents the tolerance of the connection with 
respect to transmission rate deficits; smaller values of A correspond to softer rate requirements, while, in the large A 
limit, the loss function (9) stiffens to a hard constraint where no violations are tolerated. 

In the above formulation, all sources of noise and co-channel interference by other users are collected in the effective 
channel matrix of the focal connection; in this way, H { collects all variables that are not under the direct control of 
the focal transmitter/receiver pair. As such, given that we make no assumptions on the behavior of the other connections 
in the network (or the evolution of the wireless medium itself), the matrix H may vary arbitrarily over time; our only 
assumptions will be as follows; 

(Al) H remains bounded for all time (e.g. due to RF circuit losses, antenna directivity, minimum distance between 
transmitter and receiver, etc.). 

(A2) The variability of H is such that standard results from information theory remain valid [33]. 

In this time-varying context, the throughput expression (5) becomes; 

R(Q; t) = log det [i + H (t) Q H + (f)|, (10) 

where H(f) denotes the user’s effective channel matrix at time t. With this in mind, the user’s loss function at time t 
will be 

7(Q;t) = tr[Q]-0(R(Q;f);f). (11) 

We thus obtain the following online power control problem for MIMO-OFDMA systems: 

minimize 7(Q; f), 

(OPC) 

subject to Q e I, 

1 Recall here that <f> is assumed concave, so the user’s loss grows at least linearly with the rate deficit R* - R( Q). 
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where 

X = [Q:Q>0, tr[Q]<P] (12) 

is the problem’s state space and P > 0 denotes the user’s maximum transmit power. More precisely, given that the user 
has no control over the effective channel matrices H, the sequence of events that we envision is as follows: 

1) At each update epoch n = 1,2..., the user selects a transmit power profile Q (n) e X. 

2) The user’s loss £(Q( ri); n) is determined by the state of the network and the behavior of all other users via the 
effective channel matrices H(n) at the time of the user’s transmission. 

3) The user selects a new transmit power profile Q (n + 1) e X at stage n + 1 in an effort to minimize the a priori 
unknown objective function f(Q; n + 1) and the process repeats. 

Needless to say, the key challenge in this dynamic framework is that the user does not know his objective function 
77Q ;n) ahead of time, so he must try to somehow adapt to the changing network conditions “on the fly” (recall that 
TfQ ;n) depends at each stage n on the evolution of the environment and the choices of all other users). As a result, 
static solution concepts (such as Nash or correlated equilibria) are no longer relevant because, in general, there is no 
optimum system state to target - either static or in the mean. 

Instead, given a time horizon T, we will compare the cumulative loss incurred by the user’s power profile Q(n) for 
n - 1,2,..., T, to the loss that the user would have incurred if he had chosen the best possible transmit profile in 
hindsight; specifically, we define the user’s regret as: 

RegCO = max Y T [f(Q(n); n) - £(Q*;n)]. (13) 

Q*e3C t—ln=\ 

The seminal notion of regret was first introduced in a game-theoretic setting by Hannan [21] and it has since given rise 
to an extremely active field of research at the interface of optimization, statistics and machine learning - for a recent 
survey, see e.g. [20, 22]. 2 The user’s average regret is then defined as T 1 Reg(70 and the goal of regret minimization 
is to devise a dynamic transmit policy Q(«) which is asymptotically optimal in hindsight, i.e. that leads to no regret: 

limsupj,^^ Reg(7’)/7’< 0, or, equivalently: RegfT) = o(T), (14) 

irrespective of how the objective function (9) evolves over time. 

Remark 3. Importantly, if the user’s objective (9) does not vary with time (or if it varies in a stochastic fashion, 
following some i.i.d. process), a no-regret policy converges to the problem’s static (or, respectively, average) solution 
[20]. Furthermore, if the user could predict the solution of (OPC) ahead of every stage n = 1,2,..., T in an oracle-like 
fashion, we would have Reg(74 < 0 in (13) for all T ; by this token, the no-regret requirement (14) is an indicator that 
Q (n) tracks the optimum solution Q*(n) of (OPC) as it evolves over time. 3 

’The terminology stems from the fact that large positive values of Reg (T) indicate that the user would have achieved a better power/rate trade-off 
in the past by employing some fixed Q* instead of Q (it), making him “regret" his choice. 

’In the machine learning literature, there exist more sophisticated notions of regret (such as adaptive [38] or shifting [39] regret) that further 
quantify the quality of this tracking; due to space limitations however, we will focus our theoretical analysis almost exclusively on external regret 
minimization which requires less technical language to describe. 
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III. Adaptive Power Control via Exponential Learning 

In this section, we derive an adaptive power control algorithm for the online optimization problem (OPC) based on 
the method of matrix exponential learning (MXL) [29-31]. We first consider the case where the transmitter has access 
to perfect channel state information (CSI); the case of measurement errors and imperfect CSIT is then discussed in 
Sec. III-B. 


A. Learning with perfect CSI 

A key element in our approach will be the gradient V = Vq f of the user’s objective function (9). Specifically, if the 
rate-adjustment function f is smooth, 4 we readily get: 

V = V Q f = I-0'(/?)- S7 q R. (15) 

Some matrix calculus then yields: 

V Q R = H 1 [i + HQH 1 ] 1 H, (16) 

so the gradient of f(Q(«); n) at Q (n) will be: 

V(n) = I - f'(R(Q(n)-n)) ■ H \n) [I + H(n) Q(n) H f (n)] _1 H in). (17) 

Since the effective channel matrices H(n) are assumed bounded, V(n) will also be bounded for all n\ hence, we formally 
assume that there exists a positive constant V such that 

IIVOOII < v, (18) 


where ||V|| = /l max (V) denotes the ordinary spectral norm (spectral radius) of V. 

In view of the above, a first idea would be to update the user’s power profile Q(w) along the direction of steepest 
descent indicated by V(n) [40]; however, this online gradient descent scheme would invariably violate the user’s 
semidefiniteness constraint Q > 0, so it is not a viable transmit policy. Instead, inspired by the matrix regularization 
methods of [29-31], we propose an algorithm that tracks the direction of steepest descent in a dual, unconstrained 
space and then maps the result back to the problem’s state space via matrix exponentiation. More precisely, assuming 
for the moment perfect CSIT, we will consider the matrix exponential learning scheme: 


Y(n) = Y(/i - 1) - V(n), 

exp(i]n~ 1/2 Y(n)) 


(MXL) 


Q (n + 1) = P- 


1 + tr [ exp(ritr l l 2 Y(n)] ’ 

where rj > 0 is a parameter that controls the user’s learning rate and the recursion is initialized with Y(0) = 0. 

The recursion (MXL) will be the main focus of our paper, so some remarks are in order (for an algorithmic 
implementation, see Alg. 1): 


Remark 1. Intuitively, the exponentiation step in (MXL) assigns more power to the spatial directions that perform well 
while the n 1/2 factor keeps the eigenvalues of Q (n) from approaching zero too fast (note that Y(n) grows as 0(«)); the 


4 In the general Lipschitz case, it suffices to replace <p'(R) by any element of |r//('0 ), r//(0 ‘ )|. 
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Algorithm 1: Matrix exponential learning (MXL) 

parameter: tj > 0 


/ 

* Initialization 

*/ 

l n 

<- 0; Y <- 0; 


2 repeat 


3 

n <— n + 1; 


4 

/* Pre-transmission: Set Power 
( exp(/pr 1/2 Y) 

1 + tr [ exp(?p7 - F 2 Y)] ’ 

V 


/* Transmission 

V 


/* Post-transmission: Measure Rate and Effective Channel Matrices 

V 

5 

R <— logdet(I + HQH 1 ); 


6 

V <- I - <p'(R) ■ H f (i + HQH 1 )' 1 H; 


7 

Y <— Y - V; 


8 until transmission ends: 



trace normalization then ensures that Q (n) satisfies the feasibility constraints of (OPC) for all n > 1. In particular, as 
we show in Appendix A, the recursion (MXL) can be seen as a “primal-dual” online mirror descent (OMD) method 
[20] with a variable parameter [41]; for an in-depth discussion, see [20, 29-31, 41] and references therein. 

Remark 2. From an implementation viewpoint. Algorithm 1 has the following desirable properties: 

(PI) It is distributed: each transmitter updates his own power profile based only on local CSI. 

(P2) It is asynchronous: the algorithm’s updates are event-based and can be performed without synchronization or any 
further signaling/coordination between connections. 

(P3) It is agnostic: transmitters do not need to know the status or geographical distribution of other connections in the 
network. 

(P4) It is reinforcing: each connection tends to minimize its individual loss. 

Remark 3. In terms of feedback. Algorithm 1 requires that a) transmitters measure their achieved rates; and b) the 
receiver feeds back to the transmitter the received signal covariance E [yy^J = W + HQH 1 (e.g. via broadcasting or 
over a duplex downlink). From a computational standpoint, it is then easy to see that the complexity of each iteration 
of Algorithm 1 is linear in the number of subcarriers K and polynomial in the number of transmit antennas M: in 
particular, since Y is block-diagonal, fast Coppersmith-Winograd matrix multiplication [42] provides a worst-case 
complexity bound that is 0(XM 2 ' 373 ) per iteration. 

5 We are implicitly assuming that <//( R) can be calculated with very low cost - e.g. by means of a lookup table. 
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Our main theoretical result regarding the MXL power control algorithm (Alg. 1) is as follows: 


Theorem 1. The MXL algorithm (Alg. 1) leads to no regret in the online power control problem (OPC). In particular, 
the iterates of (MXL) enjoy the 0(T^ 1 ^ 2 ) regret bound: 


Ir m< /^gO +KM) tjPV 2 \J_ pPV 2 1 
T 8 1] + 2 / Vr + 4 7” 


(19) 


irrespective of how the system evolves over time. 


Proof: See Appendix B. ■ 

The bound (19) is our main performance guarantee for Alg. 1, so we proceed with a few remarks: 

Remark 4. Even though Theorem 1 focuses on a given connection it e 11, the focal connection is still subject to 
interference from other connections in the network (the incurred interference is captured by the effective channel 
matrices H/ which depend on the interfering users’ transmit policies). In this light. Theorem 1 provides a worst-case 
performance guarantee which holds even in the presence of malicious users (jammers) that seek to shut down the focal 
connection. 

On the other hand, a natural question that arises is whether users can meet more sophisticated criteria (such as 
reaching a globally efficient state or a Nash equilibrium) when they all follow the same algorithm and the wireless 
medium is otherwise static. In the MIMO multiple access channel (where all users transmit to a common receiver), 
it can be shown that the MXL algorithm leads to a socially optimum state; a more general treatment of this question 
(e.g. in the MIMO interference channel [10]) lies beyond the scope of this paper, so we delegate it to future work. 

Remark 5. We should also note here that the first term of the bound (19) captures the dimensionality of the problem 
while the rest is an increasing function of the channel variability estimate V; as such, the learning parameter tj of 
Algorithm 1 can be fine-tuned to accelerate the algorithm’s convergence to a no-regret state in terms of V. Specifically, 
the value of // which minimizes the dominant 0(7’ l/2 j term of the regret bound (19) for a fixed time horizon T is: 

p = V~ 1 V21og(l + KM). (20) 

In turn, this parameter choice leads to the optimized convergence rate: 

\ Reg(7’) < PV V2 log (TTkM) . (21) 

The 0(1/ Vl ) dependence of (21) is known to be asymptotically tight in the context of online optimization problems 
against an adversarial nature [20], while the 0(log KM) behavior represents a significant reduction in the dimensional¬ 
ity of the problem (which has 0 (KM) degrees of freedom). In fact, (21) becomes tight only in adversarial environments 
(e.g. induced by jamming), so, in practical situations, the user’s regret minimization rate is considerably faster - cf. 
Section IV. 

Remark 6. The agnostic initialization Y(0) = 0 is a conservative choice reflecting the worst-case scenario where the 
user assumes bad channel conditions. Indeed, Y(0) = 0 corresponds to initial transmit power equal to P ■ KM/(l + 
KM) ~ P in the large K (or large M) limit; in this way, the user’s transmit power will likely be reduced under 







11 


Algorithm 1 in the presence of good channel conditions. Hence, if the transmitter has some estimate of his expected 
channel conditions, it would be preferable to initialize power accordingly: if the user expects a good channel, initial 
power should be set lower (to save battery life); otherwise, if a bad channel is expected, initial transmit power should 
be set high so as to avoid very low transmission rates in the first few frames. 


B. Adaptive power control with imperfect CS1 

In practice, a major challenge occurs if the transmitters do not have access to perfect CSI with which to update the 
adaptive power control scheme (MXL). In particular, given that each user’s gradient matrix V is determined by his 
effective channel matrix H, imperfect measurements of the users’ channel or the multi-user interference-plus-noise 
(due e.g. to pilot contamination, undersampling or other factors) could have a catastrophic effect on the no-regret 
properties of the proposed scheme (MXL). Accordingly, our goal in this section will be to examine the robustness of 
(MXL) in the presence of measurement errors and observation noise. 

To model errors of this kind, we assume that, at each update period n = 1,2,..., the transmitter observes a noisy 
estimate V(n) of the form 

V (n) = V(n) + Z(n), (22) 

where the error process Z(«) = diag (Z\ («),... ,Z k(u)) satisfies the statistical hypotheses: 

(HI) Unbiasedness: 

E [Z(n) | Q(n - 1)] = 0. (HI) 

(H2) Tame tails: 

P (||Z(n)|| > z) < A/z a for some A > 0 and for some a > 4. (H2) 

The unbiasedness assumption (HI) is a bare-bones assumption which simply boils down to asking that there is no 
biased, systematic error in the user’s CSI measurements. Likewise, (H2) posits a fairly mild control on the probability 
of observing very high errors, and is satisfied by the vast majority of statistical error distributions (including for 
instance uniformly distributed, Gaussian, log-normal, Weibull and Levy-type error processes); in particular, we do not 
assume that the measurement errors Z (n) are i.i.d., state-independent, or even a.s. bounded. 

Importantly, under these mild hypotheses for the statistics of the measurement noise, we have: 


Theorem 2. The MXL algorithm (Alg. 1) run with imperfect observations satisfying (HI) and (H2) leads to no regret 
(a.s.); in particular, it enjoys the mean regret bound: 


E 


1 


■RegCO 


/ P log(1 + KM) TjPV 2 \ 1 r/PV 2 l 




1 


Vr 


4 7” 


(23) 


where V 2 = sup,, E [||V(«)|| 2 | Q(« - l)] 2 . 


Proof: See Appendix C. ■ 

Remark 7. From an implementation perspective, we should note here that the mean bound (23) reduces to the 
deterministic bound (19) in the case of perfect CSI. Also, even though we have lim supy ^,,^ T 1 Reg(7 ) < 0 (a.s.). 
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the realized regret of Alg. 1 may exceed the mean bound (19) with positive probability. By a concentration inequality 
argument [43], it is possible to estimate analytically the probability of such deviations in terms of the central moments 
of the error process, but this analysis would take us too far afield so we do not present it here. 

Remark 8. Hypothesis (H2) implies that the error process Z has finite (central) moments of up to fourth order - in 
fact, barring pathological examples, this requirement is essentially tantamount to (H2). The importance of fourth order 
moments has to do with the fact that we are using a variable learning parameter that decays as n l/2 ; by choosing a 
slower decay rate of the form n~ y for some y e (0,1 /2), it is possible to relax Hypothesis (H2) down to second order 
moment control. However, given that (H2) already suffices for the framework at hand (and due to space limitations), 
we do not present this more general analysis here. 

IV. Numerical Results 

To validate the theoretical analysis of Section III, we conducted extensive numerical simulations over a wide range 
of design parameters and specifications. In what follows, we present a representative subset of these results, but the 
conclusions drawn remain valid in most typical mobile wireless environments. 

Throughout this section, we consider a typical cellular OFDMA wireless network that occupies a 10 MHz band 
divided into 1024 subcarriers around a central frequency f c = 2.5 GHz. We further assume that each cell employs a 
simple randomized access algorithm [44] to allocate subcarriers to the users it serves. In the following, we focus on 
U — 4 users that are located at different cells - served by different base stations (BSs) - and that have been allocated 
the same set of K = 8 subcarriers. We focus on the uplink (UL) case, so the receivers are assumed stationary whereas 
the transmitters may be either stationary or mobile, depending on the simulated scenario. Communication occurs over 
a time-division duplexing (TDD) scheme with frame duration Tf = 5 ms: specifically, transmission occurs during the 
UL subframe while receivers process the transmitted signal and provide feedback during the downlink (DL) subframe; 
upon reception of the feedback, transmitters update their transmit powers according to Algorithm 1, and the process 
repeats until transmission ends. For demonstration purposes, we simulated the case where each connection has a 
fixed rate requirement R* which varies across connections it e II so as to ensure diversity of QoS requirements (the 
users’ tolerance and loss function is defined as indicated in Remark 2). For convenience, all simulation parameters are 
summarized in Table I. 

For benchmarking purposes, the first simulated scenario focuses on the case where channels remain static during the 
transmission horizon. In Fig. (2a), we plot the evolution of the users’ objective f(Q;n) under Algorithm 1: as can be 
seen, users quickly reach an optimal state corresponding to the minimum of their loss function (i.e. minimum transmit 
power subject to the users’ rate requirements). In particular, as we see in Fig. (2b), even though all connections start 
with excessive transmit power (due to the algorithm’s conservative initialization), they converge within 2dB of their 
optimum transmit profile within a few frames (between 5 and 15, depending on the connection). Interestingly, we also 
see some slight power oscillations (of the order of 1 dB) that persist for a few frames after the initial ones: these are 
due to small violations of the users’ rate requirements (due to the power updates of other users) that cause them to 
momentarily increase their transmit power. Similar oscillations are observed with respect to the achieved/target rate 
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TABLE I 

OFDMA Network Simulation Parameters 


Number of Cells 
Cell Radius 
Central Frequency 
Available Bandwidth 
Number of OFDM Subcarriers 
Subcarrier Spacing 
Tranmit Antennas 
Receive Antennas 
Propagation Model 
BS Antenna Height 
MS Antenna Height 
Shadowing 

AWGN Spectral Power Density 
Receiver Noise Figure 
Frame duration 
Requested Bit Rate per User 
Maximum transmit power per User 
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lkm 

2.5 GHz 
10 MHz 
1024 

10.9375 kHz 
2 
2 

COST-Hata-Model 
32 m 

1.5 m 
8.9 dB 

N 0 = -174 dBm/Hz 
7 dB 
5 ms 

{764.6,113.7,909.3,1081.3} kbps 
(40.40,41.10,42.85,45.58} dBm 


gap R( Q; n)/R* depicted in Fig. (2c): users quickly get within 2-5% of their target value, but they oscillate slightly for 
a few frames before converging. Finally, in Fig. (2d), we plot the user’s average regret T 1 Reg (T) (solid lines) along 
with the theoretical bound predicted by Theorem 1 (dashed lines). To increase resolution, we plot the users’ regret 
in a logarithmic scale: in this way, the observed vertical drops to -oo correspond to the point where the users’ regret 
becomes negative (an indication of the number of frames required for the algorithm to converge). In tune with the 
above observations, we see that users only require a few frames to achieve a no-regret state. 

The second simulated scenario examines the case of imperfect CSI. Specifically, in Fig. 3, we consider the same 
network realization as in Fig. 2, but we no longer assume that transmitters receive perfect CSI during the TDD feedback 
loop; instead, we assume imperfect channel state measurements and we plot the users’ power, rate and regret under 
Algorithm 1 with noisy observations. In particular, the transmitters’ CSI deviates from its corresponding mean value 
with standard deviation of 0.50V for connections 1 and 3, and 1.00V for connections 2 and 4, respectively. Due to 
this huge uncertainty, users are more conservative and tend to use up more power to achieve their rate requirements; 
however, after an initial sampling period (lasting a few tens of frames), they confidently reduce power and converge to 
an optimum rate/power trade-off (as evidenced by the minimization of their objecitve). A similar behavior is observed 
in Fig. (3b) which shows the evolution of the users’ throughput over time: even though there are more pronounced 
fluctuations over the first few frames, all connections eventually converge to their target rates. The main performance 
degradation is in the algorithm’s convergence time: as can be seen in Fig. 3c, Algorithm 1 takes longer to converge to 
a no-regret state, chiefly due to the regret generated during the algorithm’s training phase. 

Finally, in Fig. 4, we simulate a realistic time-varying environment where the focal transmitters move at different 
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Time (ms) 


(a) Loss 


(b) Power 




(c) Rate (d) Average Regret 

Fig. 2. Adaptive MIMO-OFDM power control under total power constraints for U = 4 connections with static channel conditions. Fig. 2a depicts 
the evolution of the users’ objective function £(Q; n ) under the online power control algorithm (MXL). Similarly, Fig. 2b shows the evolution of the 
users’ total transmit power tr [Q(«)] (solid lines; dashed lines correspond to the users’ maximum transmit power); Fig. 2c shows the achieved/target 
rate gap r(Q,n)/R*. Finally, the average regret n~ l Reg(Q*;«) is plotted in Fig. 2d (solid lines), along with the theoretical bounds predicted by 
Theorem 1 (dashed lines); for simplicity, we only plot the positive part in the regret and we use a logarithmic dB scale for consistency. 


speeds. For simulation purposes, we used the extended pedestrian A (EPA), extended vehicular A (EVA), and extended 
typical urban (ETU) channel models for pedestrian (2km/h), urban vehicular (30km/h) and high speed (130km/h) 
users respectively [45]. To illustrate the variability of the users’ channels, we plot their evolving channel gains 
tr [H(rc)H^(rc)] in Fig. (4a): as can be seen, channel variations are quite wide and become more profound for higher 
user velocities. 

In this dynamic setting, the main challenge for the users is to track the optimum signal covariance profile that 
balances their transmit power against their achieved throughput (i.e. that minimizes their loss) as this optimum profile 
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(a) Power (b) Rate 



(c) Average Regret 

Fig. 3. Adaptive MIMO-OFDM power control for U — 4 connections with static channel conditions but imperfect CSIT (relative measurement 
errors depicted in each figure’s legend). Figs. 3a and 3b respectively depict the evolution of the users’ radiated power tr [Q(«)] and achieved/target 
rate gap r(Q;n)/R* under the online power control algorithm (MXL). The users’ average regret « -1 Reg(Q*;/?) is plotted in Fig. 3c (solid lines), 
along with the theoretical bounds predicted by Theorem 1 (dashed lines); for simplicity, we only plot the positive part of the regret and we use a 
logarithmic dB scale for consistency. 


evolves over time. To that end, Fig. 4b shows that the users’ radiated power under Algorithm 1 increases (to compensate 
for poor channel conditions) or decreases (when channel conditions are more favorable) in a way consistent with the 
evolution of the wireless medium (Fig. 4a). Dually, in Fig. 4c we plot the time average of the users’ achieved/target 
rate ratio: 6 as can be seen, users consistently achieve their target throughput, and their achieved/target throughput ratio 
converges to 1 over time (in practice, within a few frames for users that do not move at very high speeds). Furthermore, 

6 Time-averages are considered in order to weed out stochastic fluctuations (due to the users’ changing fading environment) that could be 
potentially misleading. 
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Time (ms) 


(a) Channel 


(b) Power 




(c) Average Rate (d) Average Regret 

Fig. 4. Adaptive MIMO-OFDM power control for U = 4 connections with time-varying channel conditions corresponding to stationary receivers 
and mobile transmitters (average speed as in each figure’s legend). For reference purposes, Fig. 4a depicts the evolution of the channel gains 
tr [H(/)H^(0] over time. Fig. 4b shows the evolution of the users’ total transmit power under Algorithm 1 (dashed lines represent the users’ 
maximum transmit power), while Fig. 4c shows the achieved/target rate gap R(t)/R*. Finally, as in the static channel case. Fig. 4d shows the 
users’ average regret Reg(T)/T: as predicted by Theorem 1, the users’ regret quickly becomes negative, indicating that their transmit policy is 
asymptotically optimal in hindsight (for simplicity, we only plot the positive part of the regret and we use a logarithmic dB scale for consistency). 


we see that connections with a softer tolerance for the satisfaction of their QoS requirements (e.g. Connection 1) are 
very aggressive in reducing transmit power when channel conditions seem to allow it, whereas connections that are 
less tolerant with respect to their QoS requirements (e.g. Connection 2) are more conservative and transmit at relatively 
high powers (resulting in higher rates) as a precaution against deep fading events. 

Finally, as in the static channel case, Fig. 4d depicts the users’ average regret over time: again, despite the pessimistic 
high-power initialization of Algorithm 1, the users’ regret drops to the no-regret regime in just a few frames (much 
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faster than the 0(1 /T) bounds predicted by Theorem 1). The reason for this faster convergence is that the worst-case 
bounds of Theorem 1 only become relevant under very adverse (or adversarial) environments, occuring for example 
when users are being jammed by a third party: in standard mobility scenarios (such as the one simulated here), the 
evolution of the wireless medium is relatively tame from a statistical perspective, so users adapt to its variability much 
faster than in the adversarial regime. 


V. Conclusions 

In this paper, we examined the trade-off between radiated power and achieved throughput in wireless MIMO-OFDMA 
systems that evolve dynamically over time as the result of changing channel conditions and user QoS requirements. 
To account for the system’s complete lack of stationarity (or any other type of average behavior that could allow the 
use of traditional solution concepts such as Nash/correlated equilibria), we provided a formulation based on online 
optimization and we derived an adaptive matrix exponential learning algorithm that leads to no regret - i.e. that 
is asymptotically optimal in hindsight, irrespective of how the wireless system varies with time. Importantly, the 
proposed algorithm requires only local CSIT and is robust with respect to measurement errors and imperfections: 
in particular, under fairly mild hypotheses for the uncertainty statistics, the proposed algorithm retains its regret 
minimization properties and converges to a no-regret state. As a result, thanks to the algorithm’s no regret property, 
the system’s users are able to track their optimal transmit power profile “on the fly”, even under randomly changing 
channel conditions and high uncertainty. 

The proposed algorithmic framework can be readily extended to different precoding schemes (such as MMSE or ZF- 
type precoders), or to account for other transmission features such as spectral mask constraints, pricing, etc. Through 
judicious use of convexification techniques, it can also be applied to non-convex energy-efficiency objectives, such as 
the users’ achieved throughput per Watt of radiated power; we intend to explore these directions in future work. 

Appendix 
Technical Proofs 

Our goal in this appendix is to prove the regret guarantees of (MXL) under both perfect and imperfect CSI 
(Theorems 1 and 2 respectively). Drawing on the approach of [41, 46], we will first establish the no-regret properties 
of Algorithm 1 in a continuous-time, “mean-field” setting, and we will then show that these properties descend to 
discrete time at the cost of an extra term in the algorithm’s regret guarantees. The algorithm’s robustness properties 
with respect to measurement noise and errors will then follow by using the theory of concentration inequalities. 

For notational clarity and convenience, we will be suppressing the dependence on time whenever possible, and we 
will write e.g. Q instead of jf t Q(t) when there is no ambiguity. 
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(MXL-c) 


A. No regret in continuous time 

We begin by considering the following continuous-time analogue of the basic power control algorithm (MXL): 

Y = -V, 

^ exp(n(f)Y) 

Q " jP l+tr[exp(/ 7 (f)Y)]’ 

where 77 (f) > 0 is a smooth, nonincreasing learning parameter and the gradient matrix V is defined as in (17). The 
following proposition shows that (MXL-c) leads to no regret in continuous time: 

Proposition 3. The learning scheme (MXL-c) guarantees the continuous-time regret bound: 


max f \ f(Q(tyj) - Myu)\dt 

Jo 


Plog(l + KM) 

W) ’ 


(24) 


for any measurable stream of effective channel matrices H(f), t > 0. In particular, if tj(t) satisfies the decay rate 
condition lim^oo t ■ r/(t) = 00 , the learning scheme (MXL-c) leads to no regret. 

Proof: We first note that the loss function f (Q: t ) is convex with respect to Q (to see this, simply recall that the 
Shannon rate function R( Q; f) is concave in Q [47] while </> is assumed concave and nondecreasing). With this in mind, 
we obtain: 

m(t)\ t) - «Q* ; t) < tr [(Q(f) - Q*) • V(f)], (25) 

where V(f) = VQ ( ,/(Q(f); f) denotes the gradient of t(-\ t) evaluated at Q(f). Accordingly, to establish the no-regret 
bound (24) for (MXL), it suffices to show that 

P-log(l + KM) 


f 


tr[(Q(f)-Q*)-V(0] dt< 


n (T) 


(26) 


for all Q* e X. 

To that end, (MXL) readily yields: 


f tr [(Q(f) - Q*) • V(f)] dt = f tr [(Q* - Q(f)) • Y(f)] dt 
Jo Jo 

= tr[Y(T) • Q*] - f tr[Q(f)Y(f)] 
Jo 


dt. 


(27) 


where we have used the fact that Y(0) = 0. To continue, note that the exponentiation step of (MXL) can be written 
more simply as: 

Q = P ■ Vu log [1 + tr exp(U)], (28) 


where we have set U = 77 Y . 7 As a result, with U = 77 Y + 77 Y, the integrand of the second term of (27) becomes: 

tr [QY] = - tr [QU] - ^ tr [QU] = - ^ log [1 + trexp(U)] - ^ tr [QU], 

77 rj z 77 at 77 z 


( 29 ) 


7 This is actually one of the main reasons behind the exponentiation step of (MXL). 
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Hence, after integrating (29) by parts (and recalling that U(0) = 0), we get: 

f tr [Q(f)Y(0] dt= log [1 + trexp(U(f))] 

Jo m) o 

+ P f -^4 log [1 + tr [exp(U(r))]] * - f 44 tr |Q(r)U(t)] dt 
Jo r/(ty jo Tj(ty 

_ P log [1 + trexp(U(J))] P log( 1 + KM) 

~ W) m 

+ I trexp(U(f))] - tr[Q(f)U(f)]] dt, (30) 

where we have used the fact that U(0) = 0 (implying in turn that trexp(U(0)) = KM). Thus, combining all of the 
above, we obtain: 

f „ [(Q( „ _ Q .). V(i)] dr - + ,r - , ’ 108 [1 + ““'"W ™ 1 

Jo 


m 


d(T) 


+ £ ^[tr[Q(/)U(f)] - Plog [1 + trexp(UW)] ] dt, (31) 

To proceed, we will require the inequality: 

tr[AX] - log [1 + tr exp(X)] < tr[A log A] + (1 -trA)log(l -trA) (32) 

valid for all Hermitian A,X, with A > 0, trA < 1, and with equality holding if and only if 

A = CXP(X) . (33) 

1 + tr exp(X) 

To establish (32), it clearly suffices to show that the supremum of its LHS for fixed A is precisely the RHS of (32). 
Accordingly, let 

F(X) = log [ 1 + tr exp(X)] - tr [AX], (34) 

so the maximizers of the LHS of (32) are given by the first-order stationarity condition VxT’(X) = 0 (simply note that 
T'(X) is strictly concave in X). By differentiating, we then obtain: 

exp(X) 


VxF(X) = 


-A. 


1 + trexp(X) 

Thus, if A > 0 and tr [A] < 1, the equation VxT’(X) = 0 always admits a (necessarily unique) solution given by: 


(35) 


X* = log A + log(l +t)I> 


(36) 


with x = trexp(X*). Moreover, setting a - trA and tracing (35) readily yields^ = a/( 1 - a), so, after some easy 
algebra, the maximum value of F will be: 

Fmax = F(XJ = tr [A log A] + (1 - a) log(l - a). (37) 

The above establishes (32) for the case A > 0 and tr [A] < 1; the boundary cases det A = 0 and/or tr [A] = 1 then 
follow by continuity. 
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Thus, returning to (31), an immediate application of (32) gives: 

tr [Q*UCT)] - Plog [1 + trexp(UCT))] < 0, (38a) 

tr[Q(f)U(0] - flog [1 + trexp(U(f))] = P • [tr[A(?) log A(f)] + (1 - a(0)log(l - a(t))] , (38b) 

where we have set A (t) = Q(t)/P and ait) = tr A(f). As for (38b), its RHS can be expressed more concisely as the 
(negative) von Neumann quantum entropy of the augmented matrix Aq (t) = diag (ait), A (t)), i.e. 

tr[A(/) log A(f)] + (1 - a(t)) log(l - a(t)) = tr[A 0 (r) log A 0 (t)] > - log(l + KM), (39) 

where the last inequality simply corresponds to the maximum value of the von Neumann entropy (recall also that 
dim(Ao) = 1 + KM) [48]. Thus, substituting (38) and (39) back into (31), we finally obtain: 


f tr [(Q(f) - Q*) • Y(0] dt < P log(l + KM) —- f dt 
Jo rj(0) Jo rj(t)~ 


f log(l + KM) 

W) ’ 


(40) 


where we have used the fact that 77 < 0 (recall that 77 has been assumed nonincreasing). The regret bound (24) then 
follows by maximizing (25) over all Q* 6 1. ■ 


B. No regret in discrete time: the case of perfect CSI 

We now return to the discrete-time process (MXL), written here in the more general form: 

Y(«) = - V" V(m) 

/ Jm=l 

Q (n + 1) = P- 


exp (Tj(n)Y(n)) 


(41) 


1 + tr [exp (t)(n)Y(n))] 

with 77 ( 77 ) = r/tr 1 ^ 2 for some positive parameter 77 > 0. To establish the regret bound (19) of Theorem 1, we will define 
an interpolated continuous-time process, use Proposition 3 to estimate the incurred regret in continuous time, and use 
a discrete-continuous comparison argument in order to bound the regret in discrete time. 

Proof of Theorem 1: We begin by constructing a continuous-time interpolation of (MXL) and comparing it to its 
discrete-time analogue. To that end, consider the continuous-time processes Y c (t) = V([f]) and t] c (t) = qi\t\) for all 
t > 0, with 77 e ( 0 ) = 77 ( 0 ) = 77 by convention. In this context, the continuous-time learning scheme (MXL-c) yields the 
processes: 

Y c (t) = - f V c (s)ds 
Jo 

exp (rf(t)Y c (t)) 


(42) 


Q c (t) = P 

whence we easily obtain: 

Y c («) = - f V c (s)ds = 

Jo 

Q c (n) = Q(n + 1 ), 


1 + tr [exp (r] c (t)Y c (t))] 

J r'tn 

Y(m)ds = - V" V( 777 ) = Y(n), 

, Z—i 772=1 

m -1 


(43) 


8 Note that \ c (t) = \(n) and r/ c (t) = 7](n - 1) for all t e (n— 1 ,n), i.e. \ c (t) precedes its discrete-time analogue, while rj c (t) lags behind it; this 
one-step offset will be key in the rest of our proof. 
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Consequently, for all n > 1 and for all t e (n - 1, ri), Holder’s inequality yields: 

|tr [(Q'O) - Q(n)) ■ V c (/)]| = |tr [(Q c « - Q f (« - 1)) ■ V c (t)]| 

< l|V c (0lf • |tr [Q e (0 - Q c (» - 1)]| < V ■ |tr [Q e (t) - Q c (n - 1)]| (44) 

Using the analysis of [31], it can be shown that the map U t-> Pexp(U)/[l + trexp(U)] is (P/2)-Lipschitz with respect 
to the spectral and nuclear norms (for the map’s domain and codomain respectively). We may thus write: 

tr [Q‘(0 - Q c (n - 1)] < \P 1^(0 Y c (0 - rf{n - 1)Y> - 1)|| = ± 77(77 - 1 )P ||Y e (/) - Y c (n - 1)||. (45) 


Furthermore, by definition, we also have: 


l|Y c (f) - Y c (n - 1)|| = 


r*t r*n— 1 r*t 

Y c (s) ds + Y c (s) ds < ||V c (s)ll ds<V-(t-n + 1) 
Jo Jo Jl 7-1 

and hence, by combining (44), (45) and (46), we get: 

|tr [(Q c (f) - Q(ri)) V e (0]| < \p l|V(n )|| 2 77(77 - 1 ) • (r - n + 1). 

Accordingly, with this discrete/continuous comparison result at hand, we get: 

f tr [Q c (t)V c (0] dt - tr [Q(«)V(n)] = f tr [Q' (t)V c (t)] Jt - tr [Q(n)V(n)] 

Jo n 11 Jn-l 

= Fj T -i f [ * (O c m c (t)) - tr (Q(n)Y(n)) ] dt 
n Jn -1 

-XT-1 f l tr [Q C WV c (0]-tr[Q(n)V c (f)]|* 

n Jn-l 

V—l T r n P ry PV ^ 

< 2j b=1 J 2 ~0(t-n + 1 )dt < —- 2, n=I - 1), 

for all T > 1. Thus, using Proposition 3 and the convexity condition (25), we obtain: 

V r \m(n)\n) - m\n)\ < V' tr[(Q(n) - Q*)V(n)] 

t J/? — 1 Z in— 1 

< tr [Q c (t)Y c (t)] dt + l —^~ ^ =1 d(n - 1) - tr [Q‘V c (0] 


dt 


J o tr [(Q e (t) - Q*) Y c (t)] dt+ 2l =1 ^ n - !) 


^ Plog(l + KM) PV 2 

- W) + ^~ 


y T T](n - 1 ), 
1 


(46) 


(47) 


(48) 


(49) 


where we used (48) and the fact that £«=i tr [Q*V(«)1 = tr [Q* V c (f)] dt in the second line. Thus, substituting 
77 ( 77 ) - min{?;/ 7 _ 1 ^ 2 , 77 }, the last term of (49) becomes 

ZL llin -0 = n + ZZ 7«' 1/2 < ^ + r]t~' ,2 dt < 77 (1 + 2 Vr), 

and our proof is completed by substituting in (49) and maximizing over all Q* 6 X. 


( 50 ) 
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C. No regret in discrete time: the case of imperfect CSI 

To prove Theorem 2, we will use Eq. (48) to bound the user’s “virtual” regret with respect to the sequence of noisy 
gradient estimates V(n), and we will then employ the Borel-Cantelli lemma to show that the user’s actual regret lies 
within a vanishing window of his “virtual” regret. 

Proof of Theorem 2: As usual, the user’s regret is bounded by: 

Reg(T) = max V/ [*(Q (n); n) - €(Q *;*)] < max V/ tr [(Q(n) - Q*) ■ V(n)], (51) 

Q*eX t—ln=l Q*eX *—ln=\ 

so, for the first part of the theorem, it suffices to show that 2„=i tr [(Q(«) - Q*) • V(n)] = o(T) for all Q* e X. To that 
end, given that V(«) = V(n) - Z(«), we have: 

tr |(Q(n) - Q*) • V(»)] = tr [(Q(«) - Q ) • V(«)] - tr |(Q(n) - Q*) • Z(n)], (52) 


where Q (n) is defined via the stochastic recursion: 

Y (n) = Y (n - 1) - V(n), 

exp(? 7 n 1/2 Y(n)) 


(53) 


Q (n + 1) = P- 


1 + tr [exp(r/tr l l 2 Y(n)] 

Going back to the proof of Thm. 1, we may then use the last inequality of (48) to rewrite (49) as: 

XL tr [(Q(n) - Qt) ■ ^ (n) ] - p]og( l j ( r ) KM) + £ XL ^ n ~ ]) iivooir • 


(54) 


The last term of (54) can then be bounded as: 

,,2 vit 


(55) 


(56) 


Z „ =1 ~ ^ Z-i n(n ~ 1} f l|V(H)!|2 + 2 l|V( ' 011 ' l|Z( ' 011 + l|Z( ” )l|2 l 

= y2 XL * n ~ l)+o (XL - d n z( ")n 2 ) • 

where we have used the triangle inequality in the first line. We now claim that 

T X 7 i T fc l ~ ^ ^ 0 as T —> oo (a.s.). 

Indeed, if we let z(n) = ||Z(n)||, Hypothesis (H2) implies that P(z(«) > h'/ 4 ~ £ ) = 0(1 /nP) for some (5 > 1 and for all 
small enough s > 0. We thus obtain: 

2T-, P(*0 = o(£" ,»-<>) = 00) 

and hence, by the Borel-Cantelli lemma, we conclude that 

Pfz(«) > u 1/4_£ for infinitely many n) = 0. 

In turn, this implies that z(n) 2 = 0 (n l ^ 2 ~ 2s ) almost surely, so, with rj(n) = rP 1 ^ 2 , we get: 

ZL - d n z («)ii 2 =o(XL n ~ V2n ' l2 ~ 2e )=° (xL 1/n2e ) =o(T) (a - s °- (59) 


< CO, 


(57) 


(58) 


For the second term of (52), let T(n) = tr [(Q(«) - Q*) • Z(«)]. Then, given that Q(«) is fully determined by Q (n - 1) 
and Z(n - 1), it follows that E \T(n) | Q (n - 1)] = 0, i.e. f(n) is a martingale difference sequence; as a result, we get 
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Iim-/-_ )co T 1 Y^n=\ £( n ) = 0 by the strong law of large numbers for martingale differences - see e.g. Theorem 2.18 in 
[43]. Combining this with (59), we then get 

XL tr KQ(») - Q ) • V(n)] = o(T) (a.s.), (60) 

i.e. (53) leads to no regret, as claimed. The mean bound (23) is then obtained by taking expectations on both sides of 
(54) and recalling that E[V(n)|Q(n - 1)] = V(n). ■ 
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